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Abstract 

Parsing Expression Grammars (PEGs) define languages by 
specifying a recursive-descent parser that recognises them. 
The PEG formalism exhibits desirable properties, such as 
closure under composition, built-in disambiguation, unifica¬ 
tion of syntactic and lexical concerns, and closely matching 
programmer intuition. Unfortunately, state of the art PEG 
parsers struggle with left-recursive grammar rules, which are 
not supported by the original definition of the formalism and 
can lead to infinite recursion under naive implementations. 
Likewise, support for associativity and explicit precedence 
is spotty. To remedy these issues, we introduce Autumn, a 
general purpose PEG library that supports left-recursion, left 
and right associativity and precedence rules, and does so ef¬ 
ficiently. Eurthermore, we identify inhx and postfix opera¬ 
tors as a major source of inefficiency in left-recursive PEG 
parsers and show how to tackle this problem. We also ex¬ 
plore the extensibility of the PEG paradigm by showing how 
one can easily introduce new parsing operators and how our 
parser accommodates custom memoization and error han¬ 
dling strategies. We compare our parser to both state of the 
art and battle-tested PEG and CEG parsers, such as Rats!, 
Parboiled and ANTLR. 

Categories and Subject Descriptors D.3.4 [Program¬ 
ming Languages]: Parsing 

Keywords parsing expression grammar, parsing, 
left-recursion, associativity, precedence 

1. Introduction 

Context Parsing is well studied in computer science. 
There is a long history of tools to assist programmers in 
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this task. These include parser generators (like the venerable 
Yacc) and more recently parser combinator libraries ||2l. 

Most of the work on parsing has been built on top of 
Chomsky’s context-free grammars (CEGs). Lord’s parsing 
expression grammars (PEGs) are an alternative formal¬ 
ism exhibiting interesting properties. Whereas CEGs use 
non-deterministic choice between alternative constructs, 
PEGs use prioritized choice. This makes PEGs unambiguous 
by construction. This is only one of the manifestations of a 
broader philosophical difference between CEGs and PEGs. 

CEGs are generative: they describe a language, and the 
grammar itself can be used to enumerate the set of sen¬ 
tences belonging to that language. PEGs on the other hand 
are recognition-based: they describe a predicate indicating 
whether a sentence belongs to a language. 

The recognition-based approach is a boon for program¬ 
mers who have to find mistakes in a grammar. It also en¬ 
ables us to add new parsing operators, as we will see in sec¬ 
tion |4] These benefits are due to two PEG characteristics. 
(1) The parser implementing a PEG is generally close to the 
grammar, making reasoning about the parser’s operations 
easier. This characteristic is shared with recursive-descent 
CEG parsers. (2) The single parse rule: attempting to match 
a parsing expression (i.e. a sub-PEG) at a given input posi¬ 
tion will always yield the same result (success or failure) and 
consume the same amount of input. This is not the case for 
CEGs. Eor example, with a PEG, the expression (a*) will al¬ 
ways greedily consume all the a’s available, whereas a CEG 
could consume any number of them, depending on the gram¬ 
mar symbols that follow. 

Challenges Yet, problems remain. Eirst is the problem of 
left-recursion, an issue which PEGs share with recursive- 
descent CEG parsers. This is sometimes singled out as a rea¬ 
son why PEGs are frustrating to use iQ . Solutions that do 
support left-recursion do not always let the user choose the 
associativity of the parse tree for rules that are both left- and 
right-recursive; either because of technical limitations 111] or 
by conscious design We contend that users should be 
able to freely choose the associativity they desire. 

Whitespace handling is another problem. Traditionally, 
PEG parsers do away with the distinction between lexing 
and parsing. This alleviates some issues with traditional lex- 
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ing: different parts of the input can now use different lex- 
ing schemes, and structure is possible at the lexical level 
(e.g. nested comments) ||3l. However, it means that whites¬ 
pace handling might now pollute the grammar as well as 
the generated syntax tree. Finally, while PEGs make linear¬ 
time parsing possible with full memoizatiorQ, there is a fine 
balance to be struck between backtracking and memoiza- 
tion |12l ■ Memoization can bring about runtime speedups at 
the cost of memory use. Sometimes however, the run time 
overhead of memoization nullifies any gains it might bring. 

Other problems plague parsing tools of all denomina¬ 
tions. While solutions exist, they rarely coexist in a single 
tool. Error reporting tends to be poor, and is not able to ex¬ 
ploit knowledge held by users about the structure of their 
grammars. Syntax trees often consist of either a full parse 
tree that closely follows the structure of the grammar, or 
data structures built on the fly by user-supplied code (seman¬ 
tic actions). Both approaches are flawed: a full parse tree is 
too noisy as it captures syntactic elements with no seman¬ 
tic meaning, while tangling grammatical constructs and se¬ 
mantic actions (i.e. code) produces bloated and hard-to-read 
grammars. Generating trees from declarative grammar anno¬ 
tations is possible, but infrequent. 

Solution To tackle these issues, we introduce a new pars¬ 
ing library called Autumn. Autumn implements a generic 
PEG parser with selective memoization. It supports left- 
recursion (including indirect and hidden left-recursion) and 
both types of associativity. It also features a new construct 
called expression cluster. Expression clusters enable the 
aforementioned features to work faster in parts of gram¬ 
mars dedicated to postfix and infix expressions. Autumn 
also tackles, to some extent, the problems of whitespace 
handling, error reporting, and syntax tree creation. By alle¬ 
viating real and significant pain points with PEG parsing. 
Autumn makes PEG parsing more practical. 

Structure This paper starts by describing the issues that 
occur when using left-recursion to define the syntax of in¬ 
fix and postfix binary operators (section |2|. Next we will 
describe our solutions to these issues (section O. Then, we 
show the freedom afforded by the PEG paradigm regarding 
extensibility and lay down our understanding of how this 
paradigm may be extended further (section I?]). Einally, we 
compare Autumn to other PEG and CEG parsers (section|5]l 
and review related work (section |6]l before concluding. Be¬ 
cause of space restrictions, we do not review the basics of 
the PEG formalism, but refer to Eord’s original paper j^]. 

2. Problems Caused by Binary Operators 

This section explains why infix and postfix binarjH opera¬ 
tors are a significant pain point in terms of expressivity, per- 

' In this context, memoization means caching the result of the invocation of 
a parsing expression at a given position. 

^Binary should be understood broadly here: n-ary infix operators (such 
as the ternary conditional operator) can be modelled in terms of binary 
operators. 

Parsing Expression Grammars Made Practical 2 


formance, and syntax tree construction. Even more so with 
PEGs, due to their backtracking nature and poor handling 
of left-recursion. These issues motivate many of the features 
supported by our parser library. Our running example is a 
minimalistic arithmetic language with addition, subtraction, 
multiplication and division operating on single-digit num¬ 
bers. Table[T]shows four PEGs that recognise this language, 
albeit with different associativity. They all respect the usual 
arithmetic operator precedence. Grammars (a), (b) and (c) 
are classical PEGs, whose specificities we now examine. 
Grammar (d) exhibits our own expression clusters and repre¬ 
sents our solution to the problems presented in this section. 
No support for left-recursion. The recursive-descent na¬ 
ture of PEGs means that most PEG parsers cannot support 
all forms of left-recursion, including indirect and hidden left 
recursion]! Left-recursion is direct when a rule designates 
a sequence of which the hrst element is a recursive refer¬ 
ence, or when a rule designates a choice which has such a 
sequence as alternate. The reason this type of recursion is 
singled out is that it is easy to transform into a non-left- 
recursive form. Left-recursion is hidden when it might or 
might not occur depending on another parsing expression. 
Eor instance, X = Y1 X can result in hidden left-recursion 
because F? might succeed consuming no input. 

PEG parsers that do not support left-recursion can only 
handle grammars (a) and (b). These parsers are unable to 
produce a left-associative parse of the input. Some tools 
can handle direct left-recursive rules by rewriting them to 
the idiomatic (b) form and re-ordering their parse tree to 
simulate a left-associative parse Hi- 

We argue it is necessary to support indirect and hidden 
left-recursion, so that the grammar author is able to organise 
his grammar as he sees fit. Autumn supports left-recursion 
natively, as will be described in section^ Using expression 
clusters, the associativity for operators that are both left- 
and right-recursive can be selected explicitly by using the 
@left_recur annotation (as shown in grammar (d)). Right- 
associativity is the default, so no annotations are required in 
that case. 

Performance issues in non-memoizing parsers. Gram¬ 
mar (a) is parsed inefficiently by non-memoizing parsers. 
Consider a grammar for arithmetic with L levels of prece¬ 
dence and P operators. In our example, L — P — 2. This 
grammar will parse a single number in 0{{P -P 1)^) ex¬ 
pression invocations (i.e. attempts to match an expression). 
Eor the Java language this adds up to thousands of invoca¬ 
tions to parse a single number. The complexity is somewhat 
amortized for longer arithmetic expressions, but the cost re¬ 
mains prohibitively high. Memoizing all rules in the gram¬ 
mar makes the complexity 0{PL), but this coarse-grained 
solution might slow things down because of the inherent 
memoization overhead: cache lookups can be expensive ||^ . 

^To the best of our knowledge. Autumn is the only available parser to 
support all forms of left-recursion with associativity selection. 
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E = S‘+’ E\S‘-’ E\S 
S = N‘*’ S\N‘/' S\N 
iV = [0 - 9] 


E = S {‘+’ E)* \S{‘-’ E)* \S 
S = N S)* \N {‘/’ S)* \S 
N = [0-9] 


(a) Layered, 
right-associative 


(b) Idiomatic 


B = E‘-I-’S'I ES I S 
S = S‘*’ N\S‘/’ N\N 
iV = [0 - 9] 

(c) Layered, 
left-associative 


E = expr 

—>■ E ‘+’ E @+ @left_recur 
^ E‘- E 

=)■ E E @-t @left_recur 

=tE‘l' E 
^ [0 - 9] @-t 

(d) Autumn expression cluster 


Table 1: 4 PEGs describing a minimal arithmetic language. 
E stands for Expression, S for Summand and N for Number. 
In contrast, parsing a number in grammar (b) is always 
0{PL). Nevertheless, grammar (a) still produces a mean¬ 
ingful parse if the operators are right-associative. Not so for 
grammar (b), which flattens the parse into a list of operands. 

PEG parsers supporting left-recursion can use grammar 
(c), the layered, left-associative variant of grammar (a). Our 
own implementation of left-recursion requires breaking left- 
recursive cycles by marking at least one expression in the 
cycle as left-recursive. This can optionally be automated. 
If the rules are marked as left-recursive, using grammar (c) 
we will parse a single-digit number in 0{PL). If, however, 
the cycle breaker elects to mark the sequence expressions 
corresponding to each operator (e.g. [E ‘-b’ S)) as left- 
recursive, then the complexity is 0{{P -b 1)^). 

Expression clusters (grammar (d)) do enable parsing in 
0{PL) without user intervention or full memoization. This 
is most welcome, since the algorithm we use to handle left- 
recursion does preclude memoization while parsing a left- 
recursive expression. 

Implicit precedence. Grammars (a), (b) and (c) encode 
precedence by grouping the rules by precedence level: op¬ 
erators in S have more precedence than those in E. We say 
such grammars are layered. We believe that these grammars 
are less legible than grammar (d), where precedence is ex¬ 
plicit. In an expression cluster, precedence starts at 0, the 
@H- annotation increments the precedence for the alternate 
it follows, otherwise precedence remains the same. It is also 
easy to insert new operators in expression clusters: simply 
insert a new alternate. There is no need to modify any other 
parsing expression^ 

3. Implementation 

This section gives an overview of the implementation of 
Autumn, and briefly explains how precedence and left- 
recursion handling are implemented. 

^We are talking about grammar evolution here, i.e. editing a grammar. 
Grammar composition is not yet supported by the library. 


3.1 Overview 

Autumn is an open source parsing library written in Java, 
available online at http: //github. com/norswap/autumn. 

The library’s entry points take a PEG and some text to 
parse as input. A PEG can be thought of as a graph of parsing 
expressions. Eor instance a sequence is a node that has edges 
towards all parsing expressions in the sequence. The PEG 
can be automatically generated from a grammar file, or built 
programmatically, in the fashion of parser combinators. 

Similarly, parsing can be seen as traversing the parsing 
expression graph. The order and number of times the chil¬ 
dren of each node are visited is defined by the node’s pars¬ 
ing expression type. Eor instance, a sequence will traverse 
all its children in order, until one fails; a choice will tra¬ 
verse all its children in order, until one succeeds. This be¬ 
haviour is defined by how the class associated to the pars¬ 
ing expression type implements the parse method of the 
ParsingExpression interface. As such, each type of pars¬ 
ing expression has its own mini parsing algorithm. 

3.2 Precedence 

Implementing precedence is relatively straightforward. Eirst, 
we store the current precedence in a global parsing state, 
initialized to 0 so that all nodes can be traversed. Next, we 
introduce a new type of parsing expression that records the 
precedence of another expression. A parsing expression of 
this type has the expression to which the precedence must 
apply as its only child. Its role is to check if the precedence 
of the parsing expression is not lower than the current prece¬ 
dence, failing if it is the case, and, otherwise, to increase the 
current precedence to that of the expression. 

Using explicit precedence in PEGs has a notable pitfall. 
It precludes memoization over (expression, position) pairs, 
because the results become contingent on the precedence 
level at the time of invocation. As a workaround, we can 
disable memoization for parts of the grammar (the default), 
or we can memoize over (expression, position, precedence) 
triplets using a custom memoization strategy. 

3.3 Left-Recursion and Associativity 

To implement left-recursion, we build upon Seaton’s work 
on the Katahdin language |@]. He proposes a scheme to han¬ 
dle left-recursion that can accommodate both left- and right- 
associativity. In Katahdin, left-recursion is strongly tied to 
precedence, much like in our own expression clusters. This 
is not a necessity however, and we offer stand-alone forms 
of left-recursion and precedence in Autumn too. 

Here also, the solution is to introduce a new type of 
parsing expression. This new parsing expression has a single 
child expression, indicating that this child expression should 
be treated as left-recursive. All recursive references must be 
made to the new left-recursive expression. 

Algorithm [T] presents a simplified version of the parse 
method for left-recursive parsing expressions. The algorithm 
maintains two global data structures. Eirst, a map from (po- 
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sition, expression) pairs to parse results. Second, a set of 
blocked parsing expressions, used to avoid right-recursion 
in left-associative parses. A parse result represents any data 
generated by invoking a parsing expression at an input po¬ 
sition, including the syntax tree constructed and the amount 
of input consumed. We call the parse results held in our data 
structure seeds 111] because they represent temporary results 
that can “grow” in a bottom-up fashion. Note that our global 
data structures are “global” (in practice, scoped to the on¬ 
going parse) so that they persist between (recursive) invoca¬ 
tions of the algorithm. Other implementations of the parse 
method need not be concerned with them. 

Let us hrst ignore left-associative expressions. When in¬ 
voking a parsing expression at a given position, the algo¬ 
rithm starts by looking for a seed matching the pair, return¬ 
ing it if present. If not, it immediately adds a special seed that 
signals failure. We then parse the operand, update the seed, 
and repeat until the seed stops growing. The idea is simple: 
on the first go, all left-recursion is blocked by the failure 
seed, and the result is our base case. Each subsequent parse 
allows one additional left-recursion, until we have matched 
all the input that could be. For rules that are both left- and 
right-recursive, the first left-recursion will cause the right- 
recursion to kick in. Because of PEG’S greedy nature, the 
right-recursion consumes the rest of the input that can be 
matched, leaving nothing for further left-recursions. The re¬ 
sult is a right-associative parse. 

Things are only slightly different in the left-associative 
case. Now the expression is blocked, so it cannot recurse, 
except in left position. Our loop still grows the seed, ensur¬ 
ing a left-associative parse. 

The algorithm has a few pitfalls. First, it requires memo- 
ization to be disabled while the left-recursive expression is 
being parsed. Otherwise, we might memoize a temporary 
result. Second, for left-associative expressions, it blocks 
all non-left recursion while we only need to block right- 
recursion. To enable non-right recursion, our implemen¬ 
tation includes an escape hatch operator that inhibits the 
blocked set while its operand is being parsed. This operator 
has to be inserted manually. 

3.4 Expression Clusters 

Expression clusters integrate left-recursion handling with 
precedence. As outlined in section |2l this results in a read¬ 
able, easy-to-maintain and performant construct. 

An expression cluster is a choice where each alternate 
must be annotated with a precedence (recall the @H- anno¬ 
tation from earlier), and can optionally be annotated with an 
associativity. Alternates can additionally be marked as left- 
associative, right-associativity being the default. All alter¬ 
nates at the same precedence level must share the same as¬ 
sociativity, hence it needs to be mentioned only for the first 
alternate. 

Like left-recursive and precedence expressions, expres¬ 
sion clusters are a new kind of parsing expression. Algorithm 
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seeds = {} 
blocked = [] 

parse expr: left-recursive expression at position: 
if seeds[position] [expr] exists then 
return seeds [position] [expr] 

else if blocked contains expr then 
return failure 

cuiTent = failure 
seeds [position] [expr] = failure 
if expr is left-associative then 
blocked, add(expr) 

repeat 

result = parse(expr.operand) 
if result consumed more input than current then 
current = result 
seeds [position] [expr] = result 


else 


remove seeds[position][expr] 
if expr is left-associative then 
blocked.remove(expr) 

return current 


Algorithm 1: Left-recursion and associativity handling. 


|2] describes the parse method of expression clusters. The 
code presents a few key differences with respect to the regu¬ 
lar left-recursion parsing algorithm. We now maintain a map 
from cluster expressions to their current precedence. We it¬ 
erate over all the precedence groups in our cluster, in de¬ 
creasing order of precedence. For each group, we verify that 
the group’s precedence is not lower than the current prece¬ 
dence. If not, the current precedence is updated to that of the 
group. We then iterate over the operators in the group, try¬ 
ing to grow our seed. After growing the seed, we retry all 
operators in the group/row the beginning. Note that we can 
do away with the blocked set: left-associativity is handled 
via the precedence check. For left-associative groups, we in¬ 
crement the precedence by one, forbidding recursive entry 
in the group. Upon finishing the invocation, we remove the 
current precedence mapping only if the invocation was not 
recursive: if it was, another invocation is still making use of 
the precedence. 

4. Customizing Parser Behaviour 

4.1 Adding New Parsing Expression Types 

The core idea of Autumn is to represent a PEG as a graph 
of parsing expressions implementing a uniform interface. 
By implementing the ParsingExpression interface, users 
can create new types of parsing expressions. Many of the 
features we will introduce in this section make use of this 
capability. 

Restrictions The only restriction on custom parsing ex¬ 
pressions is the single parse rule: invoking an expression at 
a given position should always yield the same changes to the 
parse state. Custom expressions should follow this rule, and 
ensure that they do not cause other expressions to violate it. 
This limits the use of global state to influence the behaviour 
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1 seeds = {} 

2 precedences = {} 

3 parse expr: cluster expression at position: 

4 if seeds[position] [expr] exists then 

5 return seeds [position] [expr] 


6 

7 

8 
9 

10 

11 


current = failure 

seeds [position] [expr] = failure 

min.precedence = precedences[expr] if defined, else 0 
loop: for group in expr.groups do 

if group.precedence < min.precedence then 
break 


12 

13 

14 

15 

16 

17 

18 
19 


precedences [expr] = group.precedence + 
group.left.associative ? 1 : 0 
for op in group.ops do 
result = parse(op) 

if result consumed more input than current then 
current = result 
seeds [position] [expr] = result 
goto loop 


20 

21 

22 

23 


remove seeds[position][expr] 
if there is no other ongoing invocation of expr then 
remove precedences [expr] 

return cun'ent 


Algorithm 2: Parsing with expression clusters. 


of sub-expressions. Respecting the rule makes memoization 
possible and eases reasoning about the grammar. 

The rule is not very restrictive, but it does preclude the 
user from changing the way other expressions parse. This is 
exactly what our left-recursion and cluster operators do, by 
blocking recursion. We get away with this by blocking mem¬ 
oization when using left-recursion or precedence. There is 
a workaround; use a transformation pass to make modified 
copies of sub-expressions. Experimenting with it was not 
one of our priorities, as experience shows that super-linear 
parse times are rare. In practice, the fact that binary opera¬ 
tors are exponential in the number of operators (while still 
linear in the input size) is a much bigger concern, which is 
adequately addressed by expression clusters. 

Extending The Parse State To be practical, custom pars¬ 
ing expressions may need to define new parsing states, or 
to annotate other parsing expressions. We enable this by en¬ 
dowing parsing expressions, parsers and parse states with an 
extension object: essentially a fast map that can hold arbi¬ 
trary data. There are also a few hooks to the library’s inter¬ 
nals. Our design objective was to allow most native operators 
to be re-implemented as custom expressions. Since many of 
our features are implemented as parsing expressions, the re¬ 
sult is quite flexible. 

4.2 Grammar Instrumentation 

Our library includes facilities to transform the expression 
graph before starting the parse. Transformations are speci¬ 
fied by implementing a simple visitor pattern interface. This 
can be used in conjunction with new parsing expression 
types to instrument grammars. In particular, we successfully 


used custom parsing expression types to trace the execution 
of the parser and print out debugging information. 

We are currently developing a grammar debugger for Au¬ 
tumn and the same principle is used to support breakpoints: 
parsing expressions of interest are wrapped in a special pars¬ 
ing expression that checks whether the parse should proceed 
or pause while the user inspects the parse state. 

Transforming expression graphs is integral to how Au¬ 
tumn works: we use such transformations to resolve recur¬ 
sive reference and break left-recursive cycles in grammars 
built from grammar files. 

4.3 Customizable Error Handling & Memoization 

Whenever an expression fails. Autumn reports this fact to 
the configured error handler for the parse. The default error 
reporting strategy is to track and report the farthest error 
position, along with some contextual information. 

Memoization is implemented as a custom parsing expres¬ 
sion taking an expression to memoize as operand. When¬ 
ever the memoization expression is encountered, the current 
parse state is passed to the memoization strategy. The default 
strategy is to memoize over (expression, position) pairs. Cus¬ 
tom strategies allow using memoization as a bounded cache, 
discriminating between expressions, or including additional 
parse state in the key. 

4.4 Syntax Tree Construction 

In Autumn, syntax trees do not mirror the structure of the 
grammar. Instead, an expression can be captured, meaning 
that a node with a user-supplied name will be added in 
the syntax tree whenever the expression succeeds. Nodes 
created while parsing the expression (via captures on sub¬ 
expressions) will become children of the new node. This 
effectively elides the syntax tree and even allows for some 
nifty tricks, such as flattening sub-trees or unifying multiple 
constructs with different syntax. The text matched by an 
expression can optionally be recorded. Captures are also 
implemented as a custom parsing expression type. 

4.5 Whitespace Handling 

The parser can be configured with a parsing expression to 
be used as whitespace. This whitespace specification is tied 
to token parsing expressions, whose foremost effect is to 
skip the whitespace that follows the text matched by their 
operand. A token also gives semantic meaning: it represents 
an indivisible syntactic unit. The error reporting strategy can 
use this information to good effect, for instance. 

We mentioned earlier that we can record the text matched 
by an expression. If this expression references tokens, the 
text may contain undesirable trailing whitespace. To avoid 
this, we make Autumn keep track of the furthest non¬ 
whitespace position before the current position. 

5. Evaluation 

In Table |2l we measure the performance of parsing the 
source code of the Spring framework (^ 34 MB of Java 
code) and producing matching parse trees. The measure- 
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Parser 


Time (Single) Time (Iterated) Memory 


Autumn 

13.17 s 

12.66 s 

6 154 KB 

Mouse 

101.43 s 

99.93 s 

45 952 KB 

Parboiled 

12.02 s 

11.45 s 

13 921 KB 

Rats! 

5.95 s 

2.41 s 

10 632 KB 

ANTLR v4 (Java 7) 

4.63 s 

2.31 s 

44 432 KB 


Table 2: Performance comparison of Autumn to other PEG 
parsing tools as well as ANTLR. Measurements done over 
34MB of Java code. 

ments were taken on a 2013 MacBook Pro with a 2.3GHz 
Intel Core i7 processor, 4GB of RAM allocated to the Java 
heap (Java 8, client VM), and an SSD drive. The Time (Sin¬ 
gle) column reports the median of 10 task runs in separate 
VMs. The Time (Iterated) column reports the median of 10 
task runs inside a single VM, after discarding 10 warm-up 
runs. The reported times do not include the VM boot time, 
nor the time required to assemble the parser combinators 
(when applicable). For all reported times, the average is al¬ 
ways within 0.5s of the median. All files are read directly 
from disk. The Memory column reports the peak memory 
footprint, defined as the maximum heap size measured after 
a GC activation. The validity of the parse trees was verified 
by hand over a sampling of all Java syntactical features. 

The evaluated tools are Autumn; Rats! 101, a state of the 
art packrat PEG parser generator with many optimizations; 
Parboiled, a popular Java/Scala PEG parser combinator li¬ 
brary; Mouse 1^, a minimalistic PEG parser generator that 
does not allow memoization; and, for comparison, ANTLR 

10] a popular and efficient state of the art CFG parser. 

Results show that Autumn’s performance is well within 
the order of magnitude of the fastest parsing tools. This 
is encouraging, given that we did not dedicate much effort 
to optimization yet. Many optimizations could be applied, 
including some of those used in Rats! |0]. Each parser was 
evaluated with a Java grammar supplied as part of its source 
distribution. For Autumn, we generated the Java grammar 
by automatically converting the one that was written for 
Mouse. We then extracted the expression syntax into a big 
expression cluster and added capture annotations. The new 
expression cluster made the grammar more readable and is 
responsible for a factor 3 speedup of the parse with Autumn 
(as compared to Autumn without expression clusters). 


6. Related Work 

Feature-wise, some works have paved the wa y fo r full left- 
recursion and precedence handling. OMeta ll2|] is a tool 
for pattern matching over arbitrary data types. It was the 
first tool to implement left-recursion for PEGs, albeit allow¬ 
ing only right-associative parses. Katahdin |0] is a language 
whose syntax and semantics are mutable at run-time. It pio¬ 
neers some of the techniques we successfully deployed, but 
is not a parsing tool per se. IronMeta is a port of OMeta to 
C# that supports left-recursion using an algorithm developed 
by Medeiros et al. 10]. This algorithm enables left-recursion. 


associativity and precedence by compiling parsing expres¬ 
sions to byte code for a custom virtual machine. However, 
Iron Meta doesn’t support associativity handling. 

7. Conclusion 

Left-recursion, precedence and associativity are poorly sup¬ 
ported by PEG parsers. Infix and postfix expressions also 
cause performance issues in left-recursion-capable PEG 
parsers. To solve these issues, we introduce Autumn, a 
parsing library that handles left-recursion, associativity and 
precedence in PEGs, and makes it efficient through a con¬ 
struct called expression cluster. Autumn’s performance is 
on par with that of both state of the art and widely used 
PEG parsers. Autumn is built with extensibility in mind, and 
makes it easy to add custom parsing expressions, memoiza¬ 
tion strategies and error handlers. It offers lightweight solu¬ 
tions to ease syntax tree construction, whitespace handling 
and grammar instrumentation. In conclusion. Autumn is a 
practical parsing tool that alleviates significant pain points 
felt in current PEG parsers and constitutes a concrete step 
towards making PEG parsing practical. 
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